Scrapy: part 1

Scrapy is a powerful web scraping framework for Python. A framework is still a library ("an API of functions") yet with more powerful built-in features. It can be described as the combination of all we learnt till now including requests, BeautifulSoup, lxml and RegEx. To install Scrapy, open the command prompt and run the following command:

pip install scrapy

Once scrapy is installed one can start experiencing it by just running the following command inside the command prompt (e.g. let's assume you want to scrape the http://quotes.toscrape.com/page/1/ page):

scrapy shell http://quotes.toscrape.com/page/1/

Now, you must be able to apply powerful scrapy functions to get the data you want. However, all of this are available inside the command prompt. If you want to experience the same inside a Jupyter notebook, you must try to mimic the command prompt behaviour by adding 5 additional lines as shown below (instead of running the abovementioned command). As this material is provided in a Jupyter notebook, we will also mimic the command prompt behavior, yet you are encouraged to experience it yourself.


In [1]:
import requests
from scrapy.http import TextResponse

url = "http://quotes.toscrape.com/page/1/"

r = requests.get(url)
response = TextResponse(r.url, body=r.text, encoding='utf-8')

Fine, now we are ready to apply the scrapy functions on our response object. All the code following this line is same for both Jupyter notebook users and those you chose to experience the command prompt approach.

As we covered before, there are two main ways to navigate over an HTML file: using CSS selectors and the XPath approach. While BeautifulSoup supported only the former, Scrapy has functions for both: css() for using css selectors and xpath() for the xpath approach.

CSS selectors

Let's use CSS selectors to find the title of the page.


In [2]:
response.css('title')


Out[2]:
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

As you can see it provides more information than needed. That's why there is an extract() function, that will extract only the component we are interested in without the additional information. It can be said that css() and extract() function mimic the findAll() behaviour from BeautifulSoup.


In [3]:
response.css('title').extract()


Out[3]:
[u'<title>Quotes to Scrape</title>']

Excellent! We now have the correct tag we were looking for with the text inside. If we want to choose only the text content there is no need for using additional function: one just needs to add the following component to the CSS selector ::text as shown below.


In [4]:
response.css('title::text').extract()


Out[4]:
[u'Quotes to Scrape']

In [5]:
type(response.css('title::text').extract())


Out[5]:
list

As mentioned before, the extract() function applied on the css selector mimics the findAll() behavior. This is true also about the output we receive: it has the type of list. If one needs to receive the unoce element as an output, the extract_first() function must be used, which will return the very first matched element (similarly to find() from BeautifulSoup).


In [6]:
response.css('title::text').extract_first()


Out[6]:
u'Quotes to Scrape'

In [7]:
type(response.css('title::text').extract_first())


Out[7]:
unicode

Let's now try to find the heading of the page (which is Quotes to Scrape). Heading is provided inside a <h1> tag as usually.


In [8]:
response.css('h1').extract()


Out[8]:
[u'<h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>']

Again, we can get the heading text by using the ::text guy.


In [9]:
response.css('h1::text').extract()


Out[9]:
[u'\n                    ', u'\n                ']

The latter did not really help because the heading text was inside an <a> tag, which in its turn was inside the above found <h1> tag.


In [10]:
response.css('h1 a').extract()


Out[10]:
[u'<a href="/" style="text-decoration: none">Quotes to Scrape</a>']

Nice! We found it. As you can see it has the style attribute that differenciates this <a> tag from others (kind of an identifier). We could use it to find this <a> tag even without mentioning that it is inside a <h1> guy. To do this in Scrapy, square brackets should be used.


In [11]:
response.css('a[style="text-decoration: none"]').extract()


Out[11]:
[u'<a href="/" style="text-decoration: none">Quotes to Scrape</a>']

Great! Let's now extract the text first and then go for the link inside this tag (i.e. the value of the href attribute).


In [12]:
response.css('a[style="text-decoration: none"]::text').extract()


Out[12]:
[u'Quotes to Scrape']

To get the value of href attirubute (and same for any other attirubte) the following approach can be used in Scrapy, which can be considered the alternative to get() function in BeautifulSoup or lxml.


In [13]:
response.css('a[style="text-decoration: none"]::attr(href)').extract()


Out[13]:
[u'/']

Scrapy also supports regular expressions that can directly be applied on matched response. For example, let's select only the "to Scrape" part from the heading using regular expressions. We just need to substitute the extract() function with a re() function that will take the expression as an argument.


In [14]:
# expression explanation: find Quotes, a whitespace, anything else
# return only anything else component
response.css('h1 a::text').re('Quotes\s(.*)')


Out[14]:
[u'to Scrape']

Similarly, we could use RegEx to find and match and return each for of the heading separately as a list element:


In [15]:
response.css('h1 a::text').re('(\S+)\s(\S+)\s(\S+)')


Out[15]:
[u'Quotes', u'to', u'Scrape']

Perfect, we are done now with css() function, let's now implement the same in xpath().

XPath approach


In [16]:
response.xpath('//title').extract()


Out[16]:
[u'<title>Quotes to Scrape</title>']

To get the text only, the following should be added to the Xpath argument: /text()


In [17]:
response.xpath('//title/text()').extract()


Out[17]:
[u'Quotes to Scrape']

Similarly, we can find the <a> tag inside the <h1> and extract first text then the link.


In [18]:
response.xpath('//h1/a').extract()


Out[18]:
[u'<a href="/" style="text-decoration: none">Quotes to Scrape</a>']

In [19]:
response.xpath('//h1/a/text()').extract()


Out[19]:
[u'Quotes to Scrape']

xpath() function operates in the same way as in the lxml package, which means /@href should be added to the path to select the value of the href attribute (i.e. the link).


In [20]:
response.xpath('//h1/a/@href').extract()


Out[20]:
[u'/']

This is all for Part 1. We just used Scrapy as a library and experienced part its power: Scrapy is kind of the combination of whatever we learnt till now. Yet, this is not the only reason Scrapy is powerful and demanded. The rest will be covered in following parts.

P.S. If you were using command prompt to run this code, then run exit() comand to exit Scrapy. If you want to save your commands before exiting into a Python file, then the following command will be of use:

%save my_commands 1-56

where my_commands is the name of the file to be created (change it based on your taste) and 1-56 tells Python to save the code starting from the line 1 (very begining) and ending with line 56 (put the line that you want here, last one if you want to save whole code).